Meta Data Cleaning

Database connection: First we connect to the database with pymysql. We have created a database.py class to control the connection.


In [1]:
from database import Database
database = Database(
    '<host name>',
    '<database name>',
    '<user name>',
    '<password>',
    'utf8mb4'
)
connection = database.connect_with_pymysql()

Now we will import the questions from the database and clean those data. The cleaning includes the following steps: Step 1: Decode the data, remove specials character Step 2: Remove punctuation mark Step 3: Remove extra whitespace

After that we will update the clean data in our database. Finally we will close the database connection.


In [2]:
from preprocessor import Decoder, Cleaner
# decoder instance
decoder = Decoder()
    
if connection:
    try:
        with connection.cursor() as cursor:
            # example: decode all questions
            for data in decoder.decode_in_range(cursor, 'questions', 'body', 1, 99478):
                if data:
                    if all(data):
                        try:
                            # example: punctuation remove
                            cleaned_data = Cleaner.punctuation_remover(data[1])
                            # example: whitespace reomve
                            cleaned_data = Cleaner.whitespace_remover(cleaned_data)
                            sql = "UPDATE questions SET body='" + cleaned_data + "' WHERE id= "+str(data[0])
                            cursor.execute(sql)
                            connection.commit()
                        except Exception:
                            print "Exception in updating id " + str(data[0])
    finally:
        connection.close()

After that we will update our database with the clean data on which we will continue our further analysis.